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Abstract 


Open environments present an attention 
management challenge for conversational 
systems. We describe a kiosk system 
(based on Ravenclaw—Olympus) that uses 
simple auditory and visual information to 
interpret human presence and manage the 
system’s attention. The system robustly 
differentiates intended interactions from 
unintended ones at an accuracy of 93% 
and provides similar task completion rates 
in both a quiet room and a public space. 


1 Introduction 


Dialog systems designers try to minimize disrup- 
tive influences by introducing physical and be- 
havioral constraints to create predictable environ- 
ments. This includes using a closed-talking mi- 
crophone or limiting interaction to one user at a 
time. But such constraints are difficult to apply 
in public environments such as kiosks (Bohus and 
Horvitz, 2010; Foster et al., 2012; Nakashima et 
al., 2014), in-car assistants (Kun et al., 2007; Hof- 
mann et al., 2013; Misu et al., 2013) or on mo- 
bile robots (Haasch et al., 2004; Sabanovic et al., 
2006; Kollar et al., 2012). To implement dialog 
systems that operate in public spaces, we have to 
relax some of these constraints and deal with addi- 
tional challenges. For example, the system needs 
to select the correct interlocutor, who may be only 
one of several possible ones in the vicinity, then 
determine whether they are initiating the process 
of engaging with the system. 

In this paper we focus on the problems of 
identifying a potential interlocutor in the environ- 
ment, engaging them in conversation and provid- 
ing suitable channel-maintenance cues (Bruce et 
al., 2002; Fukuda et al., 2002; Al Moubayed and 
Skantze, 2011). We address these problems in the 
context of a simple application, a kiosk agent that 
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Figure 1: Ravenclaw—Olympus augmented with 
multimodal input and output functions. 


accepts tasks such as taking a message to a named 
recipient. To evaluate the effectiveness of our ap- 
proach we compared the system’s ability to man- 
age conversations in a quiet room and in a public 
area. 

The remainder of this paper is organized as fol- 
lows: we first describe the system architecture, 
then present the evaluation setup and the results, 
then review related work and finally conclude with 
an analysis of the study. 


2 System Architecture 


Figure 1 shows the architecture; it incorporates 
Ravenclaw/Olympus (Bohus et al., 2007) stan- 
dard components (in white), new components (in 
black) and modified ones (shaded). In the system 
pipeline, the Audio Server receives audio from a 
microphone, endpoints it and sends it to the ASR 
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Figure 2: Face states; some are animations. 


engine (PocketSphinx); the decoding is passed to 
NLU (Phoenix parser). ICE (Input Confidence Es- 
timation) (Helios) assigns confidence scores for 
the input concepts. Based on user’s input and 
the context, the Dialog Manager (DM) determines 
what to do next, perhaps using data from the 
Domain Reasoner (DR). An Interaction Manager 
(IM) initiates a spoken response using Natural 
Language Generation (NLG) and Text-to-Speech 
(TTS) component. 

Three components were added: (1) Multimodal 
Capture acquires audio and human position data 
using a Kinect device ! (2) Awareness deter- 
mines whether there is a potential interlocutor in 
the vicinity and their current position, using skele- 
tal and azimuth information. (3) Talking Head 
that conveys the system’s state (as shown in Fig- 
ure 2): whether it’s active (conversing and hint- 
ing) or idle (asleep and doze) and whether fo- 
cused concepts are grounded (conversing and non- 
understanding); certain state representations (e.g., 
conversing) are coordinated with the TTS compo- 
nent. 


3 Evaluation 


A robust system should be able to function as well 
in a difficult situation as in a controlled one. We 
compare the system’s performance in two environ- 
ments, public and quiet, and evaluate the (a) sys- 
tem’s awareness of intended users, and its (b) end- 
to-end performance. 

The same twenty subjects participated in both 


'See http://www.microsoft.com/en-us/ 
Kinectforwindows/develop/. Three sources are 
tapped: the beam-formed audio, the sound source azimuth 
and skeleton coordinates. Video data are not used. 
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experiments: a mix of American, Indian, Chinese 
and Hispanic with different fluency levels of En- 
glish. None of them had previously interacted with 
this system prior to this study. 

The subjects were told that they would interact 
with a virtual agent displayed on a screen. Their 
task for the awareness experiment was to make the 
agent aware that they wished to interact. For the 
end-to-end system performance, the task was to 
instruct the agent to send a message to a named 
recipient. 


3.1 Situated Awareness 


We define situated awareness as correctly engag- 
ing the intended interlocutor (i.e., verbally ac- 
knowledge the user’s presence) under two con- 
ditions. When the user is positioned (i) inside 
the visual range of the Kinect at LOC-O in Fig- 
ure 3(a); and (ii) outside the visual range of the 
Kinect at LOC-1 in Figure 3(a). We used the effec- 
tive range of the camera’s documented horizontal 
field of view (57°); hereafter referred as its cone- 
of-awareness. 

We conducted the awareness experiment in a 
public space, a lounge at a hub connecting mul- 
tiple corridors. The area has tables and seating, 
self-serve coffee, a microwave oven, etc. The ex- 
periment was conducted during regular hours, be- 
tween 10am to 6pm on weekdays. During these 
times we observed occupants discussing projects, 
preparing food, making coffee, etc. No direct at- 
tempt was made to influence their behavior and we 
believe that they made no attempt to accommo- 
date our activities. Accordingly, the natural sound 
level in the room varied in unpredictable ways. To 
supplement naturally-occurring sounds, we played 
audio of a conversation between two humans, an 
extract from the SwitchBoard corpus (Graff et al., 
2001). It was played using a loudspeaker placed at 
LOC-2 in Figure 3(a). The locations (0, 1, and 2) 
are all 1.5m from the Kinect, which we deemed to 
be a comfortable distance for the subjects. LOC- 
1 and LOC-2 are 70° to the left and right of the 
Kinect, outside its cone. 

To detect the presence of an intended user, we 
build an awareness model that uses three sensory 
streams viz., voice activity, skeleton, and sound 
source azimuth. This model relies on the co- 
incidence of azimuth angle and the skeleton angle 
(along with voice activity) to determine the pres- 
ence of an intended user. We compare the pro- 
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Figure 3: (a) Plan of Public Space (lounge);(b) Plan of Quiet Room (lab). Dark circled markers indicate 
locations (LOC-0, LOC-1, LOC-2), discussed in the text. 


Condition | Voice | +Skeleton | +Azimuth 
Outside 

the cone 28% — 93% 
Inside 

the cone — 25% 93% 


Table 1: Accuracy for the Awareness Detection 


posed model with two baselines: (1) conventional 
voice-activity-detection (VAD): once speech is de- 
tected the system responds as if a conversation is 
initiated and (2) based on skeleton plus VAD: once 
the skeleton appears in front of the Kinect and a 
voice is heard, the system engages in conversation. 

Table 1 shows the combination of sensory 
streams we used under two conditions. For the 
outside-the-cone condition, the participants stand 
in LOC-1 as shown in Figure 3(a) and follow the 
instructions from the agent. Initially, the sub- 
ject’s skeleton is invisible to the agent; however 
the subject is audible to the agent. Therefore, in 
certain combinations of sensors (e.g., voice + 
skeleton model and voice + skeleton 
+ azimuth model) the system attempts to guide 
them to move in front of it, i.e. to LOC-0, an 
ideal position for interacting with the system. For 
inside-the-cone condition, subjects stand at LOC- 
O where the agent can sense their skeleton. 

When user stands at LOC-1 i.e., outside- 
the-cone voice + skeleton model and 
voice + skeleton + azimuth models 
are functionally the same since the source of 
distraction has no skeleton in the cone. When 
user stands at LOC-O, i.e., inside-the-cone voice 
alone is the same as voice + skeleton 
model since the agent always sees a skeleton in 
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front of it. Therefore, this variant was not used. 

We treated awareness detection as a binary de- 
cision. An utterance is classified either as 
tended” or “unintended”. We manually labeled the 
utterances whether they were directed at the sys- 
tem (“intended”), “unintended” otherwise. Accu- 
racy on “intended” speech is reported in the Ta- 
ble 1. Within each condition, the order of the ex- 
periments with different awareness strategies was 
randomized. 


“in- 


We observe that the voice + skeleton + 
azimuth model proves to be robust in the pub- 
lic space. Its performance is significantly better, 
t(38) = 8.1, p ~ 0.001, compared to the other 
baselines in both conditions. This result agrees 
with previous research (Haasch et al., 2004; Bohus 
and Horvitz, 2009) showing that a fusion of multi- 
modal features improves performance over a uni- 
modal approach. Our result indicates that a sim- 
ple heuristic approach, using minimal visual and 
audio features, provides usable attention manage- 
ment in open environments. This approach helped 
the system handle a complex interaction scenario 
such as out-of-cone speech directed to the sys- 
tem. If the speaker is out of range but is producing 
possibly system-directed utterances, system urges 
them to step to the front. We believe it can be ex- 
tended to other complex cases by introducing ad- 
ditional logic. 


3.2 End-to-End System Performance 


To investigate the effect of the environment, we 
compare the system’s performance in public space 
and quiet room. The average noise level in the 
quiet room is about 47dB(A) with computers as 


Metric Public Quiet 

Space Room 

Success Ratio 15/20 16/20 
Avg # Turns 14.2 16.4 
Concept Acc 67% 68% 


Table 2: Public Space vs Quiet Room Performance 


the primary source of noise. The background 
sound level in the public space was 46dB; other 
natural sources ranged up to 57dB. The audio dis- 
tractor measured 57dB. The same ASR acoustic 
models and processing parameters were used in 
both environments. The participant stood at LOC- 
O in Figure 3(a) during the public space experi- 
ment and Figure 3(b) during the quiet room ex- 
periment. In both experiments, LOC-O is 1.5m 
away from the system. We used the voice + 
skeleton + azimuth model to discriminate 
user speech from distractions in the environment. 

We gave each participant a randomized series 
of message-sending tasks, e.g. “send a message 
to (person) who is in room (number)”. Subjects 
had a maximum of 3 minutes to complete; each 
task required 7 turns. The number of tasks com- 
pleted (over the group) is reported in terms of 
task “success-ratio”. Table 2 shows the success- 
ratio of the task, the average number of turns 
needed to complete the task, and the system’s per- 
utterance concept accuracy (Boros et al., 1996). 
There were no statistically significant differences 
between quiet room and public space, (t(38) < 
2,p > 0.5, on any metric). We conclude that 
the channel maintenance technique we tested was 
equally effective in both environments. 


4 Related Work 


The problem of deploying social agents in public 
spaces has been of enduring interest; (Bohus and 
Horvitz, 2010) list engagement as a challenge for 
a physically situated agent in open-world interac- 
tions. But the problem was noted earlier and solu- 
tions were proposed; e.g a “push-to-talk” protocol 
to signal the onset of intended user speech (Stent 
et al., 1999). (Sharp et al., 1997; Hieronymus et 
al., 2006) described the use of attention phrase as 
a required prefix to each user input. Although ex- 
plicit actions are effective, they need to be learned 
by users. This may not be practical for systems in 
public areas engaged by casual users. 

A more robust approach involves fusing sev- 
eral sources of information such as audio, gaze 
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and pose(Horvitz et al., 2003; Bohus and Horvitz, 
2009) (Hosoya et al., 2009; Nakano and Ishii, 
2010). Previous works have shown that fusion 
of different sensory information can improve at- 
tention management. The drawback of such ap- 
proaches is in the complexity of the sensor equip- 
ment. Our work attempts to create the rele- 
vant capabilities using a simple sensing device 
and relying on explicitly modeled conversational 
strategies. Others are also using the Microsoft 
Kinect device for research in dialog. For example, 
(Skantze and Al Moubayed, 2012) and (Foster et 
al., 2012) presented a multiparty interaction sys- 
tems that use Kinect for face tracking and skeleton 
tracking combined with speech recognition. 

In our current work, we show that situational 
awareness can be integrated into an existing dia- 
log framework, Ravenclaw—Olympus, that was not 
originally designed with this functionality in mind. 
The source code of the framework presented in 
this work is publicly available for download ! and 
the acoustic models that have been adapted to the 
Kinect audio channel ? 


5 Conclusion 


We found that a conventional spoken dialog sys- 
tem can be adapted to a public space with mini- 
mal modifications to accommodate additional in- 
formation sources. Investigating the effectiveness 
of different awareness strategies, we found that a 
simple heuristic approach that uses a combination 
of sensory streams viz., voice, skeleton and az- 
imuth, can reliably identify the likely interlocutor. 
End-to-end system performance in a public space 
is similar to that observed in a quiet room, indi- 
cating that, at least under the conditions we cre- 
ated, usable performance can be achieved. This 
is a useful finding. We believe that on this level, 
channel maintenance is a matter of articulating a 
model that specifies appropriate behavior in dif- 
ferent states defined by a small number of dis- 
crete features (presence, absence, coincidence). 
We conjecture that such a framework is likely to 
be extensible to more complex situations, for ex- 
ample ones involving multiple humans in the en- 
vironment. 


Inttp://trac.speech.cs.cmu.edu/repos/ 
olympus/tags/KinectOly2.0/ 

2http://trac.speech.cs.cmu.edu/repos/ 
olympus/tags/KinectOly2.0/Resources/ 
DecoderConfig/AcousticModels/Semi_ 
Kinect.cd_semi_5000/ 
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